{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## We've already seen how to implement a linear regression where we used a single variable to predict the value of another related variable. In the case where we want to predict the value of a variable using more than one variable as input then we need to use matrices.\n", "\n", "In this notebook we'll implement a multivariate linear regression. Here we'll only cover continuous covariate variables but the method works identically if we used categorical covariates - it just requires us to do some extra processing before fitting the model!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Generate data for multivariate regression" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 6.00804119 7.32002909 -9.62139603 4.97830887 8.92353483]\n" ] } ], "source": [ "n = 1000 #Number of observations in the training set\n", "p = 5 #Number of parameters, including intercept\n", "\n", "#Assign True parameters to be estimated\n", "beta = np.random.uniform(-10, 10, p) #Randomly initialise true parameters\n", "print(beta)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "X = np.random.uniform(0,10,(n,(p-1))) \n", "X0 = np.array([1]*n).reshape((n,1)) #Columns for intercept\n", "\n", "X = np.concatenate([X0,X], axis = 1) #Join intercept to other variables to form feature matrix\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "Y = np.matmul(X,beta) + np.random.normal(0,10,n) #Linear combination of the features plus a normal error term" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "#Concatenate to create dataframe\n", "\n", "dataFeatures = pd.DataFrame(X)\n", "dataFeatures.columns = [f'X{i}' for i in range(p)]\n", "\n", "dataTarget = pd.DataFrame(Y)\n", "dataTarget.columns = ['Y']\n", "\n", "data = pd.concat([dataFeatures, dataTarget], axis = 1)\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X0X1X2X3X4Y
01.02.1553609.9212190.0443250.441540-75.531362
11.08.6661797.3909772.8593170.21806329.703580
21.03.6186146.2465150.0757298.56494869.347375
31.00.4250445.0638640.9746872.666221-2.618186
41.04.6013842.4309281.1191960.16534842.026702
\n", "
" ], "text/plain": [ " X0 X1 X2 X3 X4 Y\n", "0 1.0 2.155360 9.921219 0.044325 0.441540 -75.531362\n", "1 1.0 8.666179 7.390977 2.859317 0.218063 29.703580\n", "2 1.0 3.618614 6.246515 0.075729 8.564948 69.347375\n", "3 1.0 0.425044 5.063864 0.974687 2.666221 -2.618186\n", "4 1.0 4.601384 2.430928 1.119196 0.165348 42.026702" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The Algebra" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To fit a linear regression for a set of features $X$ and a set of targets $Y$, we compute the model parameters as:\n", "\n", "$$\\hat \\beta = (X^TX)^{-1}X^Ty$$\n", "\n", "$\\hat \\beta$ is a $p \\times 1$ vector where each element of the vector corresponds to the estimate of the true parameter which generated the data\n", "\n", "\n", "This estimator is derived using the same ideas as for the single variable case but we have to work with matrices rather than vectors - See [this link](http://home.iitk.ac.in/~shalab/regression/Chapter3-Regression-MultipleLinearRegressionModel.pdf) for a detailed derivation. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "class LinearRegressionMultivariate:\n", " \n", " def __init__(self, data, target, features, trainTestRatio = 0.9):\n", " #data - a pandas dataset \n", " #target - the name of the pandas column which contains the true labels\n", " #features - A list containing the names of the columns which we will use to do the regression\n", " #trainTestRatio - the proportion of the entire dataset which we'll use for training\n", " # - the rest will be used for testing\n", " \n", " self.target = target\n", " self.features = features \n", " \n", " #Split up data into a training and testing set\n", " self.train, self.test = train_test_split(data, test_size=1-trainTestRatio)\n", " \n", " \n", " \n", " def fitLR(self):\n", " #Fit a linear regression to the training data\n", " #Useful functions: np.matmul multiplies two matrices together, \n", " #np.transpose returns the transposition of a matrix\n", " #np.linalg.inv returns the inverse of a square matrix\n", " \n", " \n", " \n", " \n", " #Rename train and test data to make the calculation less unpleasant to look at\n", " #Change the data type from pandas dataframe to numpy array\n", " X = np.array(self.train[self.features])\n", " y = np.array(self.train[self.target])\n", " \n", " \n", " #self.betaHat should contain the estimates for the parameters\n", " #Simply a case of implementing the equation above - make sure the matrix dimensions for each term matches up!\n", " self.betaHat = #...\n", " \n", " return 0 #We've saved the parameter values as part of the class now\n", " \n", " def predict(self,x):\n", " #Given a vector (or matrix) of new observations x, predict the corresponding target values\n", " \n", " #This can be done by multiplying x by betaHat - make sure the predictions match up!\n", " \n", " pass\n", " " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "myModel = LinearRegressionMultivariate(data, 'Y', [f'X{i}' for i in range(p)])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myModel.fitLR()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Print the model estimates - there should be the right number (p) of them!" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 6.07510796 7.41825296 -9.70828959 4.82094549 9.00624211]\n", "(5,)\n" ] } ], "source": [ "print(myModel.betaHat)\n", "print(myModel.betaHat.shape) #==p" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Predict values for the test set" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "testPred = myModel.predict(np.array(myModel.test[myModel.features]))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.scatter(myModel.test[myModel.target], testPred)\n", "plt.xlabel = 'True test values'\n", "plt.ylabel = 'Predicted test values'\n", "\n", "#plot line y = x\n", "x = np.arange(np.floor(myModel.test[myModel.target].min()), np.ceil(myModel.test[myModel.target].max()))\n", "plt.plot(x,x,color = 'green')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the points roughly follow the line y = x then that's an indication the model is working well enough" ] } ], "metadata": { "kernelspec": { "display_name": "Python (cgvae)", "language": "python", "name": "cgvae" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 2 }